其他
无需训练 RNN 或生成模型,如何编写一个快速且通用的 AI “讲故事”项目?
以下为译文:
1.蓝图:概述整个项目及其构成部分。2.程序演示:在完成编写代码的工作后,作为预览演示系统的功能。3.数据加载和清理:加载数据并准备好进行处理。4.寻找最具有代表性的情节:该项目的第一部分,使用K-Means选择用户最感兴趣的情节。5.总结图:使用基于图表的总结来获取每个情节的摘要,这是UI的组成部分。6.推荐引擎:使用简单的预测式机器学习模型推荐新故事。7.综合所有组件:编写能够将所有组件结合在一起的生态系统结构。
import pandas as pd
data = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')
data.head()
data.drop(['Director','Cast'],axis=1,inplace=True)
"Grace Roberts (played by Lea Leland), marries rancher Edward Smith, who is revealed to be a neglectful, vice-ridden spouse. They have a daughter, Vivian. Dr. Franklin (Leonid Samoloff) whisks Grace away from this unhappy life, and they move to New York under aliases, pretending to be married (since surely Smith would not agree to a divorce). Grace and Franklin have a son, Walter (Milton S. Gould). Vivian gets sick, however, and Grace and Franklin return to save her. Somehow this reunion, as Smith had assumed Grace to be dead, causes the death of Franklin. This plot device frees Grace to return to her father's farm with both children.[1]"
blacklist = []
for i in range(100):
blacklist.append('['+str(i)+']')
def remove_brackets(string):
for item in blacklist:
string = string.replace(item,'')
return string
data['Plot'] = data['Plot'].apply(remove_brackets)
import gensim
string = '''
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A. At the completion of this iteration, page A will have a PageRank of approximately 0.458.
In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L( ).
In the general case, the PageRank value for any page u can be expressed as: i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. The algorithm involves a damping factor for the calculation of the pagerank. It is like the income tax which the govt extracts from one despite paying him itself.
'''
print(gensim.summarization.summarize(string))
In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.
The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.
如果文本长度小于500个字符,则直接返回原始文本。总结会让文本的内容过于简短。
如果文本只有一个句子,则genism 无法处理,因为它只能选择文本中的重要句子。我们将使用TextBlob对象,该对象具有.sentences属性,可将文本分成多个句子。如果文本的第一个句子就等于文本本身,则可以判断该文本只有一个句子。
import gensim
from textblob import TextBlob
def summary(x):
if len(x) < 500 or str(TextBlob(x).sentences[0]) == x:
return x
else:
return gensim.summarization.summarize(x)
data['Summary'] = data['Plot'].apply(summary)
"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."
'As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk.'
import string
import re
def clean(text):
return re.sub('[%s]' % string.punctuation,'',text).lower()
data['Cleaned'] = data['Plot'].apply(clean)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english',max_features=500)
X = vectorizer.fit_transform(data['Plot'])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
n_clusters = []
scores = []
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
for n in [3,4,5,6]:
kmeans = KMeans(n_clusters=n)
kmeans.fit(X)
scores.append(silhouette_score(X,kmeans.predict(X)))
n_clusters.append(n)
data['Age'] = data['Release Year'].apply(lambda x:2017-x)
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown=’ignore’)
nation = enc.fit_transform(np.array(data[‘Origin/Ethnicity’]) .reshape(-1, 1)).toarray()
for i in range(len(nation[0])):
data[enc.categories_[0][i]] = nation[:,i]
data[‘Genre’].value_counts()
top_genres = pd.DataFrame(data['Genre'].value_counts()).reset_index().head(21)['index'].tolist()
top_genres.remove('unknown')
def process(genre):
if genre in top_genres:
return genre
else:
return 'unknown'
data['Genre'] = data['Genre'].apply(process)
enc1 = OneHotEncoder(handle_unknown='ignore')
genres = enc1.fit_transform(np.array(data['Genre']).reshape(-1, 1)).toarray()
for i in range(len(genres[0])):
data[enc1.categories_[0][i]] = genres[:,i]
for i in data[data['unknown']==1].index:
for column in ['action',
'adventure', 'animation', 'comedy', 'comedy, drama', 'crime',
'crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery', 'romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller', 'unknown', 'war', 'western']:
data.loc[i,column] = np.nan
import re
data['Cleaned'] = data['Plot'].apply(lambda x:re.sub('[^A-Za-z0-9]+',' ',str(x)).lower())
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=’english’,max_features=30)
X = vectorizer.fit_transform(data[‘Cleaned’]).toarray()
keys = list(vectorizer.vocabulary_.keys())
for i in range(len(keys)):
data[keys[i]] = X[:,i]
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
column_list = ['Age', 'American', 'Assamese','Australian', 'Bangladeshi', 'Bengali', 'Bollywood', 'British','Canadian', 'Chinese', 'Egyptian', 'Filipino', 'Hong Kong', 'Japanese','Kannada', 'Malayalam', 'Malaysian', 'Maldivian', 'Marathi', 'Punjabi','Russian', 'South_Korean', 'Tamil', 'Telugu', 'Turkish','man', 'night', 'gets', 'film', 'house', 'takes', 'mother', 'son','finds', 'home', 'killed', 'tries', 'later', 'daughter', 'family','life', 'wife', 'new', 'away', 'time', 'police', 'father', 'friend','day', 'help', 'goes', 'love', 'tells', 'death', 'money', 'action', 'adventure', 'animation', 'comedy', 'comedy, drama', 'crime','crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery','romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller','war', 'western']
imputed = imputer.fit_transform(data[column_list])
for i in range(len(column_list)):
data[column_list[i]] = imputed[:,i]
data.drop(['Title','Release Year','Director','Cast','Wiki Page','Origin/Ethnicity','Unknown','Genre'],axis=1,inplace=True)
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
Xcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)
score = []
for i in [3,4,5,6]:
kmeans = KMeans(n_clusters=i)
prediction = kmeans.fit_predict(Xcluster)
score = silhouette_score(Xcluster,prediction)
score.append(score)
from sklearn.cluster import KMeans
Xcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)
kmeans = KMeans(n_clusters=3)
kmeans.fit(Xcluster)
pd.Series(kmeans.predict(Xcluster)).value_counts()
centers = kmeans.cluster_centers_
centers
Xcluster['Label'] = kmeans.labels_
for cluster in [0,1,2]:
subset = Xcluster[Xcluster['Label']==cluster]
subset.drop(['Label'],axis=1,inplace=True)
indexes = subset.index
subset = subset.reset_index().drop('index',axis=1)
center = centers[cluster]
scores = {'Index':[],'Distance':[]}
for index in range(len(subset)):
scores['Index'].append(indexes[index])
scores['Distance'].append(np.linalg.norm(center-np.array( subset.loc[index])))
scores = pd.DataFrame(scores)
print('Cluster',cluster,':',scores[scores['Distance']==scores['Distance'].min()]['Index'].tolist())
data.loc[4114]['Summary']
'On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.'
data.loc[15176]['Summary']
'Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.'
data.loc[9761]['Summary']
'Jewel thief Jack Rhodes, a.k.a. "Jack of Diamonds", is masterminding a heist of $30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn\'t cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.'
import time
starting = []
print("Indicate if like (1) or dislike (0) the following three story snapshots.")
print("\n> > > 1 < < <")
print('On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.')
time.sleep(0.5) #Kaggle sometimes has a glitch with inputs
while True:
response = input(':: ')
try:
if int(response) == 0 or int(response) == 1:
starting.append(int(response))
break
else:
print('Invalid input. Try again')
except:
print('Invalid input. Try again')
print('\n> > > 2 < < <')
print('Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.')
time.sleep(0.5) #Kaggle sometimes has a glitch with inputs
while True:
response = input(':: ')
try:
if int(response) == 0 or int(response) == 1:
starting.append(int(response))
break
else:
print('Invalid input. Try again')
except:
print('Invalid input. Try again')
print('\n> > > 3 < < <')
print("Jewel thief Jack Rhodes, a.k.a. 'Jack of Diamonds', is masterminding a heist of $30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn't cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.")
time.sleep(0.5) #Kaggle sometimes has a glitch with inputs
while True:
response = input(':: ')
try:
if int(response) == 0 or int(response) == 1:
starting.append(int(response))
break
else:
print('Invalid input. Try again')
except:
print('Invalid input. Try again')
X = data.loc[[9761,15176,4114]].drop( ['Plot','Summary','Cleaned'],axis=1)
y = starting
data.drop([[9761,15176,4114]],inplace=True)
from sklearn.tree import DecisionTreeClassifier
subset = data.drop(['Plot','Summary','Cleaned'],axis=1)
while True:
dec = DecisionTreeClassifier().fit(X,y)
dic = {'Index':[],'Probability':[]}
subdf = shuffle(subset).head(10_000) #select about 1/3 of data
for index in tqdm(subdf.index.values):
dic['Index'].append(index)
dic['Probability'].append(dec.predict_proba( np.array(subdf.loc[index]).reshape(1, -1))[0][1])
dic = pd.DataFrame(dic)
index = dic[dic['Probability']==dic['Probability'].max()] .loc[0,'Index']
print('> > > Would you be interested in this snippet from a story? (1/0/-1 to quit) < < <')
print(data.loc[index]['Summary'])
time.sleep(0.5)
while True:
response = input(':: ')
try:
if int(response) == 0 or int(response) == 1:
response = int(response)
break
else:
print('Invalid input. Try again')
except:
print('Invalid input. Try again')
if response == -1:
break
X = pd.concat([X,pd.DataFrame(data.loc[index].drop(['Plot','Summary','Cleaned'])).T])
if response == 0:
y.append(0)
else:
print('\n> > > Printing full story. < < <')
print(data.loc[index]['Plot'])
time.sleep(2)
print("\n> > > Did you enjoy this story? (1/0) < < <")
while True:
response = input(':: ')
try:
if int(response) == 0 or int(response) == 1:
response = int(response)
break
else:
print('Invalid input. Try again')
except:
print('Invalid input. Try again')
if response == 1:
y.append(1)
else:
y.append(0)
data.drop(index,inplace=True)
更多精彩推荐
☞AI 世界的硬核之战,Tengine 凭什么成为最受开发者欢迎的主流框架?
☞程序员内功修炼系列:10 张图解谈 Linux 物理内存和虚拟内存